D2GRs1 环境搭建

实验环境

generative-recommenders/requirements.txt

自定义镜像

示例:从0到1制作自定义镜像并用于训练(PyTorch+CPU/GPU)_AI开发平台ModelArts

安装docker https://docs.docker.com/engine/install/centos/
拉取base镜像

常用 AI 基础镜像及启动命令

一种是pytorch的官方镜像

docker pull --platform=linux/amd64 pytorch/pytorch:2.2.2-cuda12.1-cudnn8-devel

一种是nvidia pytorch官方镜像

docker pull --platform=linux/amd64 nvcr.io/nvidia/pytorch:23.05-py3

这里看requirements里nvidia版本是12.1 tensorrt=8.6.1 所以用23.05 https://docs.nvidia.com/deeplearning/frameworks/pytorch-release-notes/rel-24-03.html

https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch/tags

一种是cuda镜像

docker pull --platform=linux/amd64 nvidia/cuda:12.1.0-cudnn8-devel-ubi8

主要问题fbgemm-gpu

cudnn 11.8 但是会默认装 cu12 导致报错找不到12相关文件
下载 https://download.pytorch.org/whl/cu118/fbgemm-gpu/
版本对应pytorch 2.2.2以上需要重装torch并对应cu118

pip install gin-config absl-py -i https://pypi.tuna.tsinghua.edu.cn/simple

pip install tensorboard -i https://pypi.tuna.tsinghua.edu.cn/simple

pip install pandas -i https://pypi.tuna.tsinghua.edu.cn/simple